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ABSTRACT 


This  paper  presents  a  statistical  procedure  (denoted  by  SIB)  designed  to  test  for  uni¬ 
directional  test  bias  existing  simultaneously  in  several  items  of  an  ability  test.  It  was 
argued  in  Shealy  and  Stout  (1991.)  that  in  order  to  model  such  bias  with  an  IRT  model,  a 
multidimensional  model  is  necessary.  The  proposed  procedure,  based  on  this  multidimen¬ 
sional  IRT  modeling  approach,  statistically  tests  for  bias  in  one  or  more  items  at  a  time 
and  is  corrected  for  the  inflation,  (or  deflation)  of  the  test  statistic  due  to  target  ability 
difference,  a  valid  group  difference  that  is  conceptually  independent  of  psychological  test 
bias^The  correction  plays  the  same  role  as  the  practice  of  including  the  single  studied 
item  in  the~“matching  criterion*  score  in  the  Mantel-Haenszel  (MH)  procedure  adapted 
for  test  responses  by  Holland  and  Thayer  (1988).  It  is  shown  through  the  initial  portion  of 
an  extensive  simulation  study  underway  (Shealy  (1991))  that,  with  the  correction  in  place, 
the  procedure  performs  as  well  as  the  MH  procedure  in  many  cases  when  there  is  a  single 
biased  item,  and  performs  well  in  the  case  of  multiple  item  test  bias. 


Key  Words:  item  bias,  test  bias,  DIF,  latent  trait  theory,  item  response  theory,  target  abil¬ 
ity,  valid  subtest,  nuisance  determinants,  potential  for  bias,  expressed  bias,  unidirectional 
test  bias,  bidirectional  test  bias,  SIB,  Mantel-Haenszel. 
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INTRODUCTION 


The  purpose  of  this  paper  is  to  present  a  statistical  procedure  (denoted  by  SIB  for 
simultaneous  item  bias)  for  detecting  bias  present  in  one  or  more  test  items  of  a  standard¬ 
ized  ability  test.  The  procedure  is  based  on  the  multidimensional  item  response  theory 
(IRT)  model  of  test  bias  presented  in  Shealy  and  Stout  (1991).  By  “test  bias”  we  mean 
a  formalization  of  the  intuitive  idea  that  a  test  is  less  valid  for  one  group  of  examinees 
than  for  another  group  in  its  attempt  to  assess  examinee  differences  in  a  prescribed  la¬ 
tent  trait,  such  as  mathematics  ability.  Test  bias  is  conceptualized  herein  as  the  result  of 
individually-biased  items  acting  in  concert  through  a  test  scoring  method,  such  as  number 
correct,  to  produce  a  biased  test. 

Two  distinct  features  of  this  conceptualization  of  bias  are  as  follows.  First,  it  provides 
a  mechanism  for  explaining  how  several  individually-biased  items  can  combine  through  a 
test  score  to  exhibit  a  coherent  and  major  biasing  influence  at  the  test  level.  In  partic¬ 
ular,  this  can  be  true  even  if  each  individual  item  displays  only  a  minor  amount  of  item 
bias.  For  example,  word  problems  on  a  mathematics  test  that  are  too  dependent  on  so¬ 
phisticated  written  English  comprehension  could  combine  to  produce  pervasive  test  bias 
against  English-as-a-second-language  examinees.  A  second  feature,  possible  because  of  our 
multidimensional  modeling  approach,  is  that  the  underlying  psychological  mechanism  that 
produces  bias  is  addressed.  This  mechanism  lies  in  the  distinction  made  between  the  abil¬ 
ity  the  test  is  intended  to  measure,  called  the  target  ability ,  and  other  abilities  influencing 
test  performance  that  the  test  does  not  intend  to  measure,  called  nuisance  determinants. 
Test  bias  will  be  seen  to  occur  because  of  the  presence  of  nuisance  determinants  possessed 
in  differing  amounts  by  different  examinee  groups.  Through  the  presence  of  these  nuisance 
determinants,  bias  then  is  expressed  in  one  or  more  items. 

The  test  bias  detection  procedure  can  simultaneously  assess  bias  in  several  items, 
thus  addressing  the  above  two  features.  In  contrast,  most  item  bias  procedures  detailed 
in  the  literature  perform  tests  on  a  single  item  at  a  time:  The  pseudo  IRT  procedure 
of  Linn  and  Harnish  (1981)  estimates  possibly  group-dependent  item  response  functions 
(IRFs)  without  the  use  of  item  parameter  estimation  algorithms  when  the  sample  size  is 
too  small  for  their  use.  Thissen,  Steinberg,  and  Wainer  (1988)  employ  marginal  maximum 
likelihood  estimation  to  obtain  group-dependent  item  parameters  in  a  3-parameter  logistic 
framework  and  use  the  likelihood  ratio  test  to  test  the  equality  of  the  parameters  across 
group.  The  Mantel-Haenszel  procedure,  adapted  for  test  response  data  by  Holland  and 
Thayer  (1988),  and  which  is  in  wide  use,  employs  the  practice  of  using  the  score  of  the 
entire  test  instead  of  the  score  of  the  non-studied  items  as  the  “matching  criterion”  to  test 
for  item  bias.  Etc.  Conceivably  these  procedures  could  be  used  once  for  each  item  in  a  set 
of  items  being  tested  for  bias,  and  multiple  comparison  procedures  could  be  employed  to 
assess  the  hypothesis  of  the  entire  set  being  biased.  However,  if  the  amount  of  bias  is  small 
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in  each  item,  a  multiple  comparison  procedure  may  not  pick  up  bias  in  the  set  of  items  at 
all.  Moreover  this  approach  cannot  address  underlying  causal  mechanisms  of  bias. 

The  novelty  of  our  approach  to  detecting  test  bias  lies  not  so  much  with  its  recognition 
of  the  role  of  nuisance  determinants  in  the  expression  of  test  bias,  but  rather  in  its  explicit 
use  of  a  multidimensional  model  to  motivate  the  procedure  to  detect  it.  The  presence  of 
multidimensionality  of  test  item  responses  where  bias  is  present  has  long  been  recognized 
in  test  and  item  bias  studies:  Lord  (1980)  states  “if  many  of  the  items  [in  a  test]  are  found 
to  be  seriously  biased,  it  appears  that  the  items  are  not  strictly  unidimensional”  (p.  220). 
Recently,  Lautenschlager  and  Park  (1988)  employed  a  technique  of  generating  simulated 
biased  item  responses  using  a  method  of  Ansley  and  Forsyth  (1985),  which  involves  using 
multidimensional  item  response  functions  (IR-Fs)-and  latent- ability  distributions  to  deter¬ 
mine  conditional  probabilities  of  correct  response.  Kok  (1988),  taking  a  multidimensional 
viewpoint  similar  to  Shealy  and  Stout  (1991),  presents  a  specific  multidimensional  IRT 
model  for  bias  where  the  nuisance  determinants  are  compensating  abilities,  contextual 
abilities  such  as  language,  and  testwiseness. 

An  important  issue  addressed  by  our  procedure  is  that  a  careful  distinction  is  made  be¬ 
tween  genuine  test  bias,  often  operationally  embodied  as  DIF  (Holland  and  Thayer  (1988)) 
by  practitioners,  and  non-bias  differences  in  eximinee  group  performance,  sometimes  called 
impact  (see,  for  example,  Ackerman  (1991)  for  a  careful  discussion  of  impact  as  distinct 
from  bias),  that  are  caused  by  examinee  group  differences  in  target  ability  distributions. 
It  is  important  that  the  latter  not  be  mistakenly  labeled  as  test  bias.  The  procedure 
developed  herein  makes  this  distinction  in  its  application. 
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FORMULATION  OF  TEST  BIAS 


Test  bias  in  this  paper  is  modeled  using  a  multidimensional  item  response  theory 
(IRT)  model,  which  is  assumed  to  be  the  model  behind  the  observed  test  responses.  For 
purposes  of  exposition,  we  restrict  ourselves  to  the  case  where  there  is  a  single  nuisance 
determinant;  this  two-dimensional  modeling  approach  is  often  realistic  in  practice.  Exten¬ 
sions  to  multiple  nuisance  determinants  are  straightforward.  For  a  fuller  treatment  of  the 
conception  of  test  bias,  including  the  case  of  multiple  nuisance  determinants  and  item  bias 
cancellation,  in  a  more  general  framework,  see  Shealy  and  Stout  (1991)  and  Shealy  (1989). 

We  consider  two  biologically-  or  sociologically-defined  groups,  named  “reference”  and 
“focal”  groups  (after  Holland  and  Thayer’s  (1988)  naming  convention).  A  random  sample 
of  examinees  is  drawn  from  each  group,  and  a  test  of  N  items  is  administered  to  them. 
Typically  it  is  suspected  that  a  part  of  the  test  is  biased  against  the  focal  group;  this 
group  is  usually  the  object  of  the  bias  study.  The  responses  to  the  test  items  from  a 
randomly-chosen  examinee  are  denoted  £7  =  (U1 , . . .  ,£7^),  where  each  U{  can  take  on 
0  or  1,  according  as  the  response  to  item  i  is  incorrect  or  correct,  respectively. 

The  IRT  model  in  general  is  composed  of  two  components  that  generate  Uj.  (1)  a  d- 
dimensional  examinee  ability  parameter  and  (2)  a  set  of  item  response  functions  (IRFs),  one 
for  each  item,  which  determine  the  probability  of  correct  response  for  the  items.  Here  we 
restrict  the  model  to  have  d  —  1  or  2,  because  we  are  considering  a  single  nuisance  determi¬ 
nant  in  addition  to  the  target  ability.  The  ability  vector  is  (0,  rj)  for  an  arbitrary  examinee 
from  either  group,  where  0  denotes  target  ability  and  rj  denotes  the  nuisance  determinant. 
A  distribution  of  ( 0 , 77)  over  the  combined  group  of  examinees  is  induced  by  choosing  ex¬ 
aminees  at  random;  the  variable  for  a  randomly  chosen  examinee  is  denoted  (0,  rj ).  The 
IRF  for  item  i  is  denoted  Pj(0, 77),  and  it  is  assumed  that  all  items  depend  on  0 ,  and  one 
or  more  may  depend  on  77;  for  those  dependent  only  on  0 ,  the  IRF  is  P{{0).  It  is  implicitly 
assumed  that  an  IRT  representation  for  £7  in  terms  of  (0, 77)  and  { P{(0 , 77) :  i  =  1, . . .  ,N} 
is  possible;  for  a  fuller  treatment  of  this  assumption,  see  Shealy  (1989).  In  addition,  it  is 
assumed  that  each  P,(0,  77)  is  increasing  in  ( 0 , 77)  when  item  i  is  dependent  on  both  abilities 
and  increasing  in  0  when  it  is  dependent  on  0  alone;  and  that  each  Pt(0)  is  differentiable. 
Finally,  local  independence  of  £7  given  (0, 77)  is  assumed. 

Test  bias  in  the  above-mentioned  model  is  formulated  through  three  components: 

(a)  The  potential  for  bias ,  if  it  exists,  resides  within  the  target  ability/nuisance  determi¬ 
nant  distributions  of  the  two  groups  being  studied; 

(b)  potential  for  bias  is  expressed  in  items  whose  responses  depend  on  the  nuisance  de¬ 
terminant;1  and 

1  We  remark  that  Kok’s  (1988)  formulation  is  also  based  upon  (a)  and  (b);  Kok’s  and 
our  formulation  were  developed  independently  of  one  another. 
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(c)  the  scoring  rrethod  of  the  test,  to  be  viewed  as  an  estimate  of  target  ability,  transmits 

expressed  item  biases  into  test  bias. 

Potential  for  test  bias  is  explained  prosaically  in  the  following  manner.  After  condi¬ 
tioning  on  a  particular  6t  suppose  that  the  reference  group  has  a  higher  level  of  nuisance 
ability  on  average  than  the  focal  group.  Then  those  reference  group  examinees  with  abil¬ 
ity  0  would  have  an  overall  advantage  over  the  corresponding  focal  group  examinees  when 
responding  to  items  at  least  partially  dependent  on  the  nuisance  determinants  77  (formally, 
because  of  the  monotonicity  of  the  items  IRFs  P,(0,  t?)).  Formally,  we  define  the  potential 
for  test  bias  at  9: 

Definition  1.  Potential  for  test  bias  exists-against-the-foeal  group  at  target  ability  level  8 
with  respect  to  77  if  77  |  0  =  8,  G  =  F  is  stochastically  less  than  77 1  0  =  9,  G  =  R,  where 
“ G  =  F”  denotes  sampling  from  the  focal  group  and  “G  =  R ”  sampling  from  reference 
group.  Potential  for  bias  exists  against  the  reference  group  if  the  converse  holds. 

Note  that  we  are  restricting  consideration  to  conditional  nuisance  distributions  77]©  = 
8,  G  =  R  and  77  |  0  =  9,  G  =  F  that  are  stochastically  ordered;  that  is,  where  the 
two  distribution  functions  do  not  intersect.  Figure  1  displays  two  distributions  that  axe 
stochastically  ordered  and  also  two  distributions  that  are  not. 


place  Figure  1  about  here 


In  order  for  test  bias  to  occur,  it  must  be  expressed  in  one  or  more  items, 
of  expressed  bias  for  an  item,  when  specialized  to  Kok’s  model,  is  really  th 


Our  definition 
e  same  as  that 
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of  Kok  (1988,  p.  269).  It  is  defined  in  terms  of  a  marginalization  of  the  multidimensional 

IRFW.i?). 

Definition  2.  Let  P^O^rf)  be  the  IRF  for  item  i.  The  marginal  IRF  for  group  g  (g  =  R 
or  F )  with  respect  to  target  ability  8  is  defined  as 

Ti,w  =  £(Pj(0,i?)  |  0  =  S,G  =  s).  (1) 

When  77  |  6  has  a  conditional  density,  f(r}  |  8)  say,  Definition  2  translates  into 

Tig(6)=  rptwito- 

J —00 

Definition  3.  Expressed  bias  for  item  i  against  the  focal  group  occurs  at  target  ability  8 
if  TiF(8 )  <  TiR{8)\  it  occurs  against  the  reference  group  if  the  converse  holds. 

A  test  can  consist  of  many  items  simultaneously  biased  by  the  same  nuisance  determi¬ 
nant.  In  this  case,  items  can  cohere  and  act  through  the  prescribed  test  score  to  produce 
substantial  bias  against  a  particular  group  even  if  individual  items  display  undetectably 
small  amounts  of  item  bias.  This  is  the  final  (and  novel)  component  of  our  formulation  of 
test  bias  mentioned  above.  We  consider  the  large  class  of  test  scores  of  the  form 

M£0  (2) 

where  h(u)  is  real  valued  with  domain  u  =  ( ux , ...  ,uN)  such  that  u,-  =  0  or  1  for  i  ~ 
1,...  , N  and  h(u )  is  coordinate  wise  non-decreasing  in  u.  This  class  contains  many  of 
the  standard  scoring  procedures  for  many  standard  models;  for  example,  number  correct, 
linear  formula  scoring  of  the  form  Xli=i  ai^v  w^h  a,-  >  0,  maximum  likelihood  estimation 
of  ability  for  certain  logistic  models  with  item  parameters  assumed  known,  etc.  In  this 
paper  we  restrict  attention  to  number  correct  as  the  test  score;  the  results  presented  herein 
are  easily  extendable  to  other  forms  of  h(u).  The  key  point  about  number  correct  scoring 
is  that  each  Item  is  weighted  equally.  Thus,  if  a  subset  of  the  items  is  suspected  of  bias, 
we  should  give  equal  weight  to  the  items  in  this  “studied”  subtest  in  our  attempt  to 
quantitatively  assess  the  amount  of  test  bias  resulting  from  the  simultaneous  influence  of 
thses  items.  We  thus  define  test  bias  for  a  specified  studied  subtest  of  items  as  follows: 

Definition  4.  Let  {IT,-,  ,Uib}  beany  subtest  of  items  to  be  studied  for  bias  from 

the  test  of  concern  and  define 

b 

m = E  u<,  ■  <3> 

i=i 

Then  this  studied  subtest  of  items  displays  test  bias  against  the  focal  group  at  9  if 
E[h(U)  \Q  =  9,G  =  F)<  E[h(U )  \Q  =  9,G  =  R). 
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The  subtest  is  biased  against  the  reference  group  if  the  converse  holds. 

Finally,  the  components  of  the  bias  formulation  can  be  integrated  using  the  followir0 
theorem,  adapted  from  Theorem  4.2  in  Shealy  and  Stout  (1991): 

Theorem  1.  Fix  a  target  ability  9  and  choose  the  subtest  scoring  method  h(u )  of  the 
form  (3).  Assume  potential  for  bias  against  the  focal  group  at  9  holaa  ^Definition  1).  Then 
test  bias  exists  against  the  focal  group;  i.e., 

6  b 

£ rn,  |  e  - g  -  j]  <  £  spr,,  |  e  = «,  g  -  *].  « 

;=i  i= i 

In  order  to  test  for  bias  of  the  above  form,  there  must  be  an  implicit  assumption  that  a 
portion  of  the  test  measures  only  the  target  ability; -otherwise;  a  conditional-on-observed 
score  procedure  to  detect  bias  is  not  possible.  This  set  of  items  will  be  denoted  the  valid 
subtest.  The  issue  of  the  existence  and  identification  of  a  valid  subtest  is  extremely  difficult 
to  frame  philosophically  (it  is  really  an  issue  of  construct  validity)  and  must  primarily  be 
an  empirical  decision  based  on  expert  opinion  or  data  at  least  in  part  external  to  the  test 
being  studied;  it  is  not  dealt  with  here.  For  a  fuller  discussion,  see  Shealy  and  Stout  (1991). 
For  notational  simplicity  we  denote  the  valid  subtest  to  consist  of  first  n  <  N  items  of 
the  test,  and  we  call  the  remainder  of  the  N  —  n  items  the  studied  subtest.  We  note  that 
use  of  a  valid  subtest  is  operationally  equivalent  to  making  use  of  a  subset  of  items  whose 
purpose  is  to  partition  examinees  into  “comparable”  sets  as  is  done  in  the  MH  procedure 
described  below  and  other  DIF  procedures.  Hence,  the  proposed  use  of  a  valid  subtest  in 
the  SIB  procedure  can  be  interpreted  either  in  the  strong  sense  of  our  test  bias  paradigm 
or  in  the  weak  sense  of  the  DIF  paradigm  (of  matching  of  “comparable”  examinees).  Thus 
use  of  our  statistical  procedure  for  assessing  bias  in  no  way  requires  acceptance  of  our  bias 
framework  as  opposed  to  a  “comparability”  framework,  where  no  claims  about  “bias”  are 
made. 

Using  the  above  conventions,  the  specification  of  test  bias  against  the  focal  group  at 
9  becomes  ^ 

TfW  =  £  TiF(6)  <  £  riR(0)  =  rR(«)  (5) 

i'=n+l  i=n+l 

because  Tig(9 )  =  E[Ui  |  0  =  9,  G  =  <7]  by  a  simple  application  of  a  standard  conditioning 
formula  to  Definition  2.  Tg{9)  is  called  the  studied  subtest  response  function  for.  group  g. 

Unidirectional  test  bias 

Test  bias  heretofore  has  been  considered  conditional  on  a  single  target  ability;  we  now 
turn  to  a  global  perspective.  If  there  is  test  bias  against  the  same  group  for  all  9 ,  then 
there  is  unidirectional  bias  against  this  group.  Specifically,  if 

B(9)  =  TR{9 )  -  Tf(9) 
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is  the  level  of  bias  against  Group  F  at  0,  then  unidirectional  bias  holds  if  either  B{9)  >  0 
for  all  9  or  B{9)  <  0  for  all  9.  A  strong  form  of  unidirectional  bias,  termed  uniform 
bias  by  Mellenbergh  (19S2),  is  the  iype  of  bias  that  the  modified  Mantel-Haenszel  test 
statistic  devised  by  Holland  and  Thayer  (1988)  is  designed  to  detect.  Although  the  Mantel- 
Haenszel  approach  is  not  dependent  on  an  IRT  framework,  it  can  be  put  in  a  Rasch 
model  IRT  framework,  with  the  single  biased  item  having  group-dependent  item  difficulties. 
Here,  the  bias  is  “uniform”  in  the  sense  that  TF(9)  is  merely  TR{9 )  shifted  horizontally. 
Unidirectional  bias  is  less  restrictive  in  that  Tg{9 )  does  not  have  to  be  a  logistic  IRF,  and 
more  importantly,  TR(9 )  does  not  have  to  be  TF{9 )  shifted. 

Since  we  axe  concerned  with  bias  against  the  focal  group,  it  is  intuitive  that  a  suitable 
theoretical  unidirectional  bias  index  is 

where  fF{9 )  is  the  probability  density  function  of  0  for  the  focal  group.  Equivalent  in¬ 
dices  weighted  by  the  reference  target  ability  distribution  and  the  combined-group  target 
distribution  are  easily  conceptualized. 

THE  BASIC  PROCEDURE 

The  statistical  procedure  to  be  presented  is  based  on  (6);  the  hypothesis  is 

H  :  /3fj  =  0  vs.  fa j  >  0, 

the  alternative  being  one-sided  to  specifically  test  for  bias  against  the  focal  group.  The 
test  statistic  to  be  constructed  is  essentially  an  estimate  of  Pv  normalized  to  have  unit 
variance.  The  estimate  of  fiy  is  derived  first. 

Since  test  bias  is  analyzed  using  number  correct  on  the  studied  subtest,  set 

y-  E  u>  P) 

i=n+l 

to  be  the  studied  subtest  score;  also  set  X  =  Uj  to  be  the  valid  subtest  score.  In 
selecting  the  valid  subtest  score  to  be  number  correct,  we  follow  the  convention  set  out  in 
Holland  and  Thayer  (1988),  among  many  others.  Other  choices  would  of  course  be  possible 
and  could  improve  the  performance  of  the  procedure. 

The  naive  intuition  is  that  examinees  with  the  same  valid  subtest  score  are  examinees 
of  approximately  equal  target  ability  and  thus  such  examinees  are  directly  comparable  in 
the  assessment  of  bias.  Thus  the  difference 

YRk  —  YFki  &  =  0,  ...,n,  (8) 
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where  Ygk  is  the  average  Y  for  all  examinees  in  group  g  attaining  valid  subtest  score  X  =  fc, 
should  provide  a  measure  of  the  bias  against  the  focal  group  (resulting  from  the  reference 
group  having  superior  nuisance  ability  77  on  average).  In  particular,  if  there  is  no  bias  ( H 
holds),  then  Yric  ~  Y Fk  —  0  for  ail  k  should  be  observed,  and  if  there  is  unidirectional 
bias  against  the  focal  group  ( B(9 )  >  0  for  all  9)  then  Y. Rk  ~  Y Fk  >  0  for  all  k,  except  for 
statistical  error,  should  be  observed. 

The  above  assertion  needs  support;  it  will  suffice  to  argue  that 

E[YRk  —  Tp*]  =  0  for  all  k  if  B(9)  =  0  for  all  9 ,  and 

E[YRk  -  yFfc]  >  0  for  all  k  if  B{9)  >  0  for  all  9 ..  ^ 

For  now  we  restrict  the  target  ability  distributions  to  be  equal  for  the  two  groups;  i.e., 

0  |  G  =  R  and  0  |  G  =  F  have  the  same  distribution.  It  is  easy  to  prove  (following  (5)) 

under  the  model  presented  herein  that 

£[?,»]  =  E\Y\X  =  fc,G  =  s]  =  £[T,(e) \X  =  k,G  =  g).  (10) 

Now  assume  that  the  valid  subtest  is  long  enough  so  that  the  distribution  of  0  |  X  =  k, 
G  —  g  is  tightly  concentrated  about  its  mean,  and  hence  that  Tg{9)  is  locally  flat  within 
the  range  of  9  where  the  distribution  of  0  |  X  =  k,  G  =  g  mostly  resides.  Then 

B[r,(8)  I X  =  k, G  =  5]  >  Ts(£[0  \X  =  k,G  =  g])  (11) 

=  rs(£[0  |  A'  =  *]), 

because  the  two  target  ability  distributions  are  equal  and  expectation  is  a  linear  operator. 
Thus,  denoting  9k  =  jS[0  |  X  =  k], 

E\YRk  -  ?„]  =  B(9k).  (12) 

Thus  (9)  follows  easily;  the  n  +  1  differences  in  (8)  provide  an  estimate  of  B{9)  at  n  +  1 
points  in  the  0-domain.  It  is  intuitive  that  an  estimate  of  is- 

n 

Pu  -  YMYRk  -  Ypk)  (!3) 

k=  o 

where  pk  is  the  proportion  (among  focal  group  examinees)  attaining  X  =  k.  Specifically, 
if  Jgk  is  the  number  of  examinees  in  group  g  attaining  X  =  k,  then  pk  =  JFk/J2k=o  / Fk- 
In  the  case  where  the  target  ability  distributions  are  the  same,  then,  it  is  straightfor¬ 
ward  that 

sfc] = = A/  (i4) 

k=0 
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where  pk  =  P[X  =  k  |  G  =  F],  Thus  the  expected  value  of  Py  is  a  weighted  difference 
of  marginal  IRFs,  this  weighted  difference  approximating  py,  which  is  a  continuously 
weighted  difference  of  marginal  IRFs.  From  (14),  it  follows  that  E(3y  =  0  if  Py  =  0,  and 
EPy  >  0  if  Py  >  0.  This  suggests  the  standardized  test  statistic 

n  _  Pu 

*tfv) 

for  testing  if,  where  the  denominator  is  defined  as 

U-o  \JRt  Jn  J) 

where  d2(F  |  k,g)  is  the  sample  variance  of  the  studied  subtest  scores  of  those  group  g 
examinees  with  valid  subtest  score  k.  A  full  description  of  the  computation  of  the  test 
statistic,  with  contingencies  for  exclusion  of  certain  valid  subtest  scores  based  on  inadequate 
examinee  counts,  is  presented  in  the  Appendix.  B  approximately  standard  normal  when 
Py  =  0  and  the  target  ability  distributions  are  the  same,  because  Py  is  the  weighted  sum 
of  approximately  normal  random  variables  YRk-YFk>  these  are  approximately  normal  (for 
suitable  sample  sizes)  by  the  central  limit  theorem  (proof  of  asymptotic  normality  of  B 
omitted). 

The  regression  correction  for  target  ability  difference 

The  presence  of  a  difference  in  target  ability  distributions  in  test  bias  studies  has  been 
treated  in  various  contexts  in  the  literature.  The  issue  of  the  linking  of  metrics  across  group 
in  the  estimation  of  IRT  item  parameters  is  one  such  context  (see  Linn,  et  al  (1981)  for  an 
IRT  item  bias  approach  where  linking  of  metrics  is  crucial).  Holland  and  Thayer  (1988) 
also  deal  with  this  problem  by  including  the  single  studied  item  in  the  matching  criterion 
score  of  the  Mantel-Haenszel  test;  they  prove  that  this  method  completely  compensates 
for  target  ability  difference  (in  their  context,  the  distributional  difference  in  the  postulated 
unidimensional  latent  trait)  when  the  underlying  IRT  model  is  a  Rasch  model.  Millsap 
and  Meredith  (1989)  elegantly  formulate  the  problem  in  terms  of  a  divergence  of  two 
hypotheses  (a  “conditional  on  observed  score”  hypothesis  and  a  “latent  trait”  hypothesis), 
which  would  occur  if  target  ability  difference  is  present.  A  “conditional  on  observed  score” 
procedure  such  as  (15)  in  its  present  form  is  not  adequate  to  address  the  separation  of 
target  ability  difference  from  test  bias;  the  presence  of  target  ability  difference  when  in 
fact  there  is  no  test  bias  present  can  statistically  inflate  B ,  thereby  suggesting  test  bias 
actually  is  present.  It  is  therefore  necessary  to  formulate  a  correction  for  target  ability 
difference. 


(15) 
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To  motivate  the  proposed  correction  it  is  necessary  to  show  that  a  decomposition  of  the 
differences  ?Rk  ~  Y. Fk  into  “test  bias  only”  and  “target  ability  difference  only”  components 
is  possible.  First  we  note  that  by  similar  arguments  to  those  used  in  deriving  (10)  and  (11), 

£i*y  =  r, («,,),  (I?) 

where  9gk  =  E[Q  |  h,g].  The  condition  E[YRk  —  FFfc]  =  0  requires  9Rk  =  dFk ,  as  in  (11) 
where  g  was  removed  from  the  conditioning;  but  this  may  not  happen  if  the  target  ability 
distributions  are  not  the  same,  as  Figure  2  suggests.  Figure  2,  which  displays  densities 
for  four  distributions,  assumes  that  the  distribution  of  0  |  F  is  stochastically  smaller  than 
that  of  0  |  R. 


place  figure  2  about  here 


Note  that  the  (conditional)  distribution  of  0  |  fc,  F  is  stochastically  smaller  than  that 
of  0  |  k,R  for  all  k.  The  standard  Bayesian  calculation  makes  this  insight  rigorous.  Thus, 
0Fk  <  dRk  for  all  k ,  and,  in  the  absence  of  bias,  where  TR{9 )  =  TF(6 )  =  T{9)  for  all  6, 

EYn  =  T(eFk)  <  T(6Rk )  =  EYRk 

( T{6 )  is  assumed  monotone;  for  mild  conditions  giving  such  monotonicity,  see  Shealy  and 
Stout  (1991)).  Thus 

n 

Eh  =  '£pkmeRk)-T(eFt))>  o. 

fc=0. 

In  the  case  where  bias  is  present,  we  can  thus  decompose  E[^v}: 

£[/%]  =  y ^PkC^R^Rk)  ~  TF{9rI:))  +  “  Tp(8Fk)) 

k=o  ^  k=°  (18) 

=  'Y^Pk^i^Rk)  +  Y2PkT'F(Qk)(.0Rk  ~  &Fk)> 

0  k=0 

where  6k  is  between  dRk  and  9Fk.  ( TF(9 )  is  assumed  differentiable  here  and  the  mean 
value  theorem  has  been  applied.)  The  first  term  is  due  only  to  test  bias;  the  second  is  due 
only  to  target  ability  difference. 
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This  approximate  decomposition  argument  is  the  motivation  behind  the  proposed 
correction.  Our  strategy  is  to  adjust  Ypk  to  Yh  such  that  the  inflating  effect  of 
the  group  differences  in  target  ability  is  eliminated.  The  manner  this  is  accomplished  is  to 
construct  Yru  and  YFk  so  that  they  are  estimating  the  studied  subtest  response  functions 
TR{9)  and  TF{9)  at  approximately  the  same  target  ability  9k  defined  below  (as  opposed 
to  two  different  ones,  as  is  evident  from  (17)). 

A  natural  attempt  to  make  adjustments  to  Yru  and  YFk  is  to  approximate  TR{9)  and 
Tp{9 )  in  the  neighborhood  of  9Rk  and  9Fk  by  linear  functions.  If  we  assume  that  9Rk  and 
9Fk  are  sufficiently  close  together  to  do  this,  TR(9 )  and  TF(9)  can  be  linearly  interpolated 

at  @k  —  2  +  ^FJt): 


Tg(9k)  =  Tg(9gk)  +  mgk(6k-9gk)  (19) 

where 


m  _  Tg(Qg,k+ 1)  Yg{9g  k_ j) 

S‘  «,,*+!  -  «,,»->  ’ 

however,  though  estimates  of  Tg{9gk)  (namely,  Ygk)  are  available  for  all  fc,  estimates  for 
{9gk  :  k  =  0, . . .  ,n}  are  not.  Abilities  on  the  0-scale  are  not  observable;  however,  one  can 
estimate  abilities  on  the  scale  defined  by  the  valid  subtest ,  namely 


u  =  P{9) 


where  P(9)  is  the  average  of  the  valid  subtest  IRFs  £  -P,-(0).  P(0)  |  G  =  g  is  the 

true  score  for  a  randomly  chosen  group  g  examinee,  i.e.,  the  valid  subtest  true  score  P(0) 
for  group  g.  Let 


V's(I)  =  £:(P(0)|X  =  a:,G  =  9],  (20) 

the  (theoretical)  regresion  of  true  on  observed  (here,  valid)  score.  Fs(a:)  can  be  easily 
estimated  using  classical  true  score  theory,  assuming  that  the  above  regression  is  linear  or 
nearly  so.  The  estimation  of  Vg(x)  is  deferred  to  the  appendix.  Denote  this  estimator  by 

fy*)- 

At  this  point  it  is  expedient  to  describe  three  latent  scales,  which  must  be  simulta¬ 
neously  considered  in  order  to  understand  the  correction.  Figure  3  delineates  the  three 
scales  and  should  be  referred  to  frequently. 
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place  figure  3  about  here 


So,  the  interpolation  of  (19)  must  be  transformed  so  as  to  use  the  easily  estimable 
Vg(k )  instead  of  9gk.  Through  a  monotonic  transformation  P(9),  Vg(k )  and  6gk  represent 
approximately  (“approximately”  because  P(9g k)  =  Vg{k )  will  be  demonstrated  below) 
the  same  ability  on  two  different  latent  scales  and  thus  for  our  purposes  interchangeable. 
Note  that  s  =  Tg(9)  defines  a  monotonic  transformation  from  the  fundamental  latent 
scale  to  the  studied  subtest  scale,  and  v  =  P{9 )  defines  one  from  the  fundamental  scale 
to  the  valid  subtest  scale.  Tg{9 )  must  be  transformed  so  we  can  use  the  valid  subtest 
scale  as  domain,  because  abilities  on  this  scale  can  be  estimated.  Figure  4  illustrates  the 
appropriate  correspondence, 


place  figure  4  about  here 


thus  defining  a  new  transformation  Sg(v)  =  Ts(P_1(v))  from  valid  subtest  scale  to  studied 
subtest  scale,  with  domain  (c,  1)  and  range  (c,  1)  (c  >  0  is  the  guessing  parameter,  assumed 
common  for  all  items  in  the  test). 

With  this  transformation  in  hand,  the  correction  can  be  performed  in  the  following 
manner.  First,  by  the  same  arguments  as  used  in  (10)  and  (11),  using  P{9)  in  place  of 
Tg(9)  in  the  arugments, 

W  =  Wit,,|).p(y  (21) 

So  P~1(Vg(k))  =  9gk  by  continuity;  and 
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also  by  continuity.  By  definition  of  Sg(v ),  this  becomes  Sg(Vg(k))  =  Tg(9gk),  and  thus 
by  (17), 

EYgk  =  Sg(Vg(k)).  (22) 

Thus  Ygk  is  a  reasonable  estimation  of  Sg(Vg(k))  for  each  k.  To  transform  (19)  into 
an  interpolation  involving  Sg(-),  we  assume  that  Sg(v )  can  be  approximated  by  a  linear 
function  in  a  small  region  about  Vg(k),  and  that  VR(k )  and  VF(k )  are  close  enough  to 
allow  the  approximation  to  be  effective.  Then,  we  interpolate  SR(VR(k))  and  SF(VF(k)) 
to  their  respective  values  at  Vk  =  \{VR(k)  +  Vp(k)): 


Sg(Vk)  =  Sg(Vg(k))  +  m;k(Vk  -  Vg(k)\  (23) 


where 

.  S(Vg(k  +  l))-Sg(Vg(k-l)) 
vg(k  +  l)-Vg(k-l) 

is  the  approximate  slope  of  Sg(v)  in  the  region  of  VJk)  and  Vk.  All  of  the  above  terms  on 
the  right  hand  side  of  (23)  are  estimable;  using  Ygk  to  estimate  Sg(Vg(k)),  we  define  the 
adjusted  nv 

Y,\  =  Y,t  +  M,k(Vk  -  Vs(k))  (24) 

where,  recalling  that  the  estimator  Vg(x)  is  given  in  the  Appendix, 


Vg(k  +  1)  -  Vg(k  -  1) 


and  define  Vk  =  $(VR(k)  +  Vp(^)).  Because  the  right  hand  side  of  equation  (24)  is  a  good 
estimator  of  the  right  hand  side  of  (23),  Ygk  is  thus  a  good  estimator  of  Sg(Vk).  Finally, 
must  be  shown  to  be  a  good  estimator  of  Tg(8)  at  the  same  9  for  both  groups.  By  definition 
of  ■S'j(v),  ^(V’jt)  =  Tfl(P_3(14)).  If  9Rk  and  9Fk  are  sufficiently  close  together  then  P(9) 
may  be  taken  to  be  approximately  linear  in  the  neighborhood  of  9k  =  (8Rk  +  8Fk)/2.  Thus, 
using  (21)  and  assuming  approximate  linearity  of  P  in  the  neighborhood  of  9k , 


Vk  =  \(VR(k)  +  VF(k )) 

=  n(P(0Rk)  +  P(®Fk)) 
=  P{0k). 


Thus,  by  the  continuity  of  P(9 ), 


9k=P~\Vk). 
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Hence,  by  the  definition  of  Sg(v ) 

Ss(Vi)=Ts(p-1(Vk))  =  T,(et). 

Thus,  because  Ygk  has  been  shown  to  be  a  good  estimator  of  Sg(Vk),  it  is  shown  that 
Ygk  is  a  good  estimator  of  Tg(9k).  Thus,  YRk  —  Ypk,  as  desired,  is  a  good  estimator  of 
TR(9k )  —  TF(9k),  i.e.,  of  the  difference  of  the  marginal  IRFs  at  the  same  9 ,  establishing 
the  usefulness  of  the  interpolation  (19). 

(24)  is  called  the  regression  correction  for  target  ability  difference.  Thus,  with  the 
correction  (24)  in  place,  (13)  can  be  reconstructed,  with 

(25) 

k= 0 

and  B  defined  as  in  (15).  Rejection  of  the  hypothesis  of  no  test  bias  (H  :  =  0)  occurs 

when  B  >  za ,  where  P[Ar(0, 1)  >  za)  =  a  defines  zQ.  This  procedure  will  be  referred  to 
as  the  SIB  procedure,  “SIB”  for  simultaneous  item  bias. 

Thus,  the  contribution  to  the  differences  yRk  -  yFk  due  to  target  ability  difference 
has  been  eliminated.  It  is  extremely  instructive  to  note  that  the  correction  (24)  is  the 
sample  analogue  of  (23),  which  is  basically  the  decomposition  (19),  albeit  on  a  different 
latent  scale  (though  the  two  latent  scales,  9  and  V,  are  indistinguishable  up  to  a  monotonic 
tranformation). 

A  modification  of  the  basic  procedure  to  achieve  better  statistical  behavior 

Redefine  pk  to  be  the  proportion  of  all  examinees  (focal  and  reference  group)  attaining 
X  =  k.  That  is  pk  =  ( JFk  +  JRk)/  J2k=o(^Fk  +  J Rk )•  Substitute  this  new  pk  into  (25) 
and  (16)  to  obtain  the  statistic  B  of  (15).  Because  of  a  slightly  better  adherence  in 
simulation  studies  to  the  nominal  level  of  significance  when  the  hypothesis  of  no  test  bias 
holds,  this  new  choice  of  pk  is  recommended  over  the  slightly  more  intuitive  choice  based 
upon  focal  group  examinees  alone.  The  power  performance  of  both  versions  of  B  when 
test  bias  was  present  was  very  similar.  It  is  upon  this  version  of  the  SIB  statistic  that  our 
simulation  studies  reported  below  are  based. 

SIMULATION  STUDY 

In  order  to  assess  the  performance  of  the  procedure  in  a  variety  of  testing  situations, 
a  moderate-sized  (84  simulation  cases)  simulation  study  was  performed.  Three  parameter 
logistic  item  parameters  actually  estimated  from  two  test  data  sets,  an  ACT  math  test 
(estimated  by  Drasgow  (1987))  and  an  ASVAB  auto  shop  test  (estimated  by  Mislevy  and 
Bock  (1984)),  are  used  to  specify  the  IRFs  in  the  IRT  model.  Univariate  and  bivariate 
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normal  ability  distributions,  appropriately  centered  relative  to  the  test  item  parameters 
(for  the  purpose  of  good  measurability  of  target  ability),  are  used  for  the  focal  and  reference 
groups.  Two  levels  of  bias  and  three  levels  of  target  ability  difference  axe  simulated;  tests 
with  a  singly-based  item  and  with  three  biased  items  are  used  in  the  simulations.  The  level 
of  guessing  in  the  tests  is  varied.  Finally,  group  size  pairs  of  (3000,3000),  (3000,1000), 
and  (1500, 1500)  for  the  reference  group  and  focal  group  examinees  respectively  are  used. 

Each  simulation  model  is  run  100  times  (trials).  For  a  particular  simulation  model,  the 
item  parameters  and  the  two  ability  distributions  for  the  two  groups  are  fixed;  however, 
at  each  trial,  a  new  set  of  examinees  (ability  parameters)  is  generated  from  the  ability 
distributions. 

When  a  single  item  is  to  be  studied  in  a  simulation, .the  Mantel-Haenszel  procedure  as 
modified  by  Holland  and  Thayer  is  run  in  parallel  in  order  to  provide  an  external  reference 
to  compare  to  and  to  compare  our  procedure  with. 


Item  parameters 


Estimated  item  parameters  from  the  above  mentioned  tests  were  used  to  construct  test 
models;  the  ASVAB  test  length  is  25,  and  the  ACT  test  length  is  40.  Table  1  gives  the  sum¬ 
mary  statistics  for  the  a’s,  b’s,  and  c’s  as  estimated  by  Mislevy  and  Bock  and  by  Drasgow; 
for  the  actual  parameter  values,  see  Mislevy  and  Bock  (1984)  and  Drasgow  (1987). 


place  table  1  here 


The  test  for  each  simulation  was  generated  in  the  following  manner.  Let  N  denote 
test  length  and  nb  the  number  of  items  to  be  studied  for  possible  bias.  First,  nb  was  chosen 
to  be  either  1  or  3.  There  were  two  cases  to  consider. 

1.  No  bias:  unidimensional  items  are  used  for  the  entire  test. 

2.  Bias:  unidimensional  items  are  used  in  the  valid  subtest,  and  2-dimensional  items  are 
used  in  the  studied  subtest. 
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place  table  2  about  here 


In  the  first  case,  nb  of  the  N  items  were  chosen  randomly  to  be  the  studied  ones,  and 
the  remainder  were  used  as  the  valid  subtest.  In  the  second  case,  n  =  N  —  nb  items  were 
chosen  at  random  from  either  the  ASVAB  or  the  ACT  test  to  be  the  valid  subtest,  and 
the  2-dimensional  studied  item  parameters  were  chosen  according  to  Table  2.  Note  that 
the  studied  item  guessing  parameters  are  a  function  of  the  average  and  standard  deviation 
of  the  guessing  parameters  on  the  ASVAB  or  A'CTTe'st's;  the  studied  item  a’s  and  b’s  axe 
the  same  for  both  tests. 

The  IRFs  axe  for  case  1  (no  bias) 


0-  “  ci) 


pt(d)  ci  +  j  +  eXp(-1.7at.e(0  -  bi6)) 


(26) 


where  aig  and  big  are  the  target  discrimination  and  difficulty  for  item  i.  In  case  2  (bias), 
items  1  to  n  were  of  the  form  (26),  and  items  n  +  1  to  N  (studied  items)  had  IRFs 


l)  =  ci  + 


_ (izii) _ 

1  +  exp(— 1.7(ai$(6  -  big)  -1-  a,„(0  -  6,-,))) 


i  =  n  +  1, . . .  ,  JV.  (27) 


The  final  factor  in  determining  the  item  parameters  was  whether  or  not  to  include  guessing; 
that  is,  whether  to  assume  2PL  or  3PL  modeling.  The  presence  of  guessing  is  thought 
to  influence  the  performance  of  the  procedure.  Thus,  in  some  simulation  models,  the 
estimated  ct-’s  from  the  literature  were  used  in  conjunction  with  (26)  and  (27);  in  others, 
all  c,’s  were  set  to  0  producing  a  2PL  model.  A  detailed  description  of  the  experimental 
design  of  the  simulations  follows. 


Ability  distributions 

Specifying  the  ability  distributions  involves  choosing  the  five  parameters  determining 
the  bivariate  normal  distributions  for  each  group  in  such  a  way  to  meet  the  following  goals: 

1.  Introduce  a  specified  amount  of  group  difference  between  target  ability  distributions. 

2.  Require  the  test  to  measure  the  target  ability  well,  as  would  be  true  for  any  “good” 
test. 

3.  Introduce  a  specified  amount  of  potential  for  bias  into  the  distributions. 

4.  In  the  case  of  2-dimensional  studied  items  (bias  case),  require  that  examinee  nuisance 
abilities  be  influential  in  determining  the  response  to  the  item,  e.g.,  that  target  and 
reference  group  examinees  have  moderate  nuisance  abilities. 
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Each  goal  is  elaborated  upon  separately  below.  The  bivariate  distributions  for  group  g 
(g  =  R  or  F )  is  denoted 


0|<A 
v\  9  J 


~N 


1  P 

’U  iJ 


(28) 


where  p  =  Corr(0,77  |  G  =  g)  is  taken  to  be  the  same  for  both  groups  ( p  taken  to  be 
different  across  group  tends  to  introduce  bidirectional  bias,  where  marginal  IRFs  in  9  for 
the  two  groups  cross;  see  Shealy  (1989)).  Note  that  a2(0  |  g)  and  a2(r]  |  g)  are  taken  to 
be  1  in  our  study. 

Goal  1.  We  first  define  target  ability  difference.  We  need  some  notation;  let  aR  — 
the  proportion  of  the  entire  (conceptual)  population  of  examinees  who  are  referece  group 
members,  and  aF  =  1  —  aR  be  the  corresponding  proportion  for  the  focal  group.  (Note: 
as  JR  and  JF  both  increase  to  oo,  conceptually,  — *  &r  and  —*  <*F.  Here  Jg 

denotes  the  number  of  sampled  Group  g  examinees.)  Define 


^  _  PeR  ~  Pbf 
T  a6P 

to  be  the  target  ability  difference  between  the  focal  and  reference  groups,  where 


(29) 


°]p  =  £*r0-2(0  |  R)  +  aFa2(0  |  F). 


(30) 


Note  that  when  (28)  holds  a26p  =  1  and  thus  that  dT  =  peR  —  peF.  dT  is  a  quantity 
specified  in  the  simulations. 

Goal  2.  The  criterion  used  to  ensure  good  measurability  of  6  by  the  test,  is  that  the 
average  difficulty  ( b )  of  the  valid  subtest  should  be  close  to  the  average  target  ability  over 
the  pooled  groups.  Specifically,  peR  and  peF  are  chosen  so  that 

b  =  E[Q]  =  aRp,eR  +  aFpeF.  (31) 


b  is  taken  from  Table  1.  p6R  and  peF  are  completely  determined  by  specification  of  dT 
and  (31). 

Goal  3.  We  use  a  more  restrictive  version  of  Definition  1  to  define  potential  for  bias:  set 


Cfi(0)  =  E[n  i  0  =  9,G  =  R]  -  JE7[t7  I  0  =  0,G  =  F].  (32) 

Cp(8)  >  0  is  defined  to  be  the  potential  for  bias  against  the  focal  group.  When  (28)  holds, 
(32)  becomes 

Cp{0)  =  Cp  =  pvR  -  ppeR  ~  (pnF  -  pneF) 

=  (Pt)R  -  Pt)f)  ~  PiPeR  -  Pof)  =  (Pr)R  “  Pvf)  ~  P&T-> 
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0  dropping  out  because  the  ability  correlation  ( p )  is  equal  for  both  groups.  Note  that 
because  Cp  is  constant  for  all  0,  unidirectional  bias  is  being  introduced.  For  a  specified 
amount  of  Cp,  p^R  and  p^F  are  determined  partially.  The  reader  should  note  that  potential 
for  bias  can  hold  even  though  nvR  —  pnF  unless  p.eF  =  peR. 

Goal  4.  The  criterion  used  to  ensure  nuisance  determinant  influence  is  the  following.  The 
nuisance  difficulties  for  ail  studied  items  were  chosen  to  be  0.  For  an  arbitrarily  chosen 
target  ability  (say  0  =  0)  we  thus  want  the  average  nuisance  ability  to  be  near  0  as  well. 
Thus  we  choose 

E[r}\Q  =  0,G  =  R}  =  —E[tj  |  0  =  0,  G  =  F]  (34) 

i.e.,  the  conditional  nuisance  expectation. at  Q-  =  0  is..to. be. centered  around  the  average 
studied  item  nuisance  difficulty  of  0,  for  the  reference  and  focal  groups.  Our  intent  in  this 
study  was  to  introduce  bias  against  the  focal  group,  so  E[tj  \  0,  i?]  >  0  in  (34)  and  thus  we 
get 

d  <  Pr)R  ~  PPeR  =  ~(PijF  ~  PPof)\  (35) 

this  will  specify  p.vR  and  p^p,  along  with  specification  of  Cp  in  (33). 

There  is  an  additional  issue  here:  how  large  should  Cp  be  chosen  to  introduce  a 
“moderate”  or  “severe”  amount  of  bias  into  the  2-dimensonal  studied  items  of  Table  2? 
This  is  treated  below,  in  the  experimental  design  of  the  study. 

Goals  1-4  now  completely  specify  (28):  p6R,  p9F,  pnR,  and  pvF  can  be  found  by 
olving  (29),  (31),  (33),  and  (35)  simultaneously  for  them,  p,  a2(0  |  g),  and  a2{r]  |  g )  are 
chosen:  p  =  .5,  and  all  er’s  are  1. 

Choice  of  Cp 

The  amount  of  potential  for  bias  Cp  in  each  simulation  model  was  chosen  so  that  the 
actual  level  of  bias  /Jy  produced  was  such  that  the  power  behavior  of  the  statistic  can  be 
well  assessed  for  the  given  examinee  sample  sizes,  valid  subtest  use'4  (recall  Table  1),  and 
biased  items  used  (recall  Table  2).  These  j3v  values  (rounded  to  two  significant  figures) 
are  shown  in  Table  3.  The  governing  equations  determining  Cp  from  f3v  were 

h  =  f( TM  - 

where 

N 

r,(«)=  £  B[p,.(0,ij)|e  =  e>c?  =  s]  (36) 

i=n+l 

with  Pi(0,r) )  defined  in  (27)  and  the  item  parameters  in  (27)  defined  in  Table  2,  and  the 
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place  table  3  about  here 


parameters  of  the  ( Q,ij )  distribution  determined  from  (29),  (31),  (33),  and  (35).  One 
standard  often  used  to  interpret  from  a  practitioner’s  viewpoint  the  magnitude  of  the  bias 
is  that  the  bias  is  “moderate”  if  0.5  <  A MH  <  1  while  it  is  “large”  if  >  1,  where 
Aa/h  the  theoretical  index  based  on  use  of  the  Mantel-Haenszel  log  odds  ratio  proposed 
by  Holland  and  Thayer  (19S8).  The  rationale  for  H  and  Pv  are  different,  but  for  nb  =  1 
and  unidirectional  bias,  they  tend  to  be  highly ‘correlated  and  are  crudely  related  by 


Pu  =  Amh/10. 


Thus,  roughly,  0.05  <  Py  <  0-1  would  constitute  moderate  bias  while  Py  >  0.1  would 
constitute  large  bias.  Thus  in  the  nb  =  1  case,  referring  to  Table  4,  the  amount  of  bias 
being  simulated  is  actually  either  (low)  moderate  or  small.  Examination  of  (36)  shows  that 
Py  is  a  measure  of  how  much  lower  the  probability  of  getting  the  biased  item  right  is  for 
an  average  focal  group  examinee  as  compared  with  an  average  reference  group  examinee 
of  the  same  target  ability.  Thus  Py  has  a  natural  and  useful  empirical  interpretation.  In 
our  context,  by  contrast,  is  a  measure  of  horizontal  distance  between  TR{8)  and 

Tf(6)  at  y  =  ^  (i.e.,  the  value  of  T^1((l  +  c)j 2)  —  T^'1((l  +  c)/2)),  where  c  is  defined 
in  Table  1. 
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Experimental  design 

The  design  is  as  follows.  For  the  case  of  no  test  bias  (Cp  =  0),  for  each  test  type 
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(ASVAB  Auto  Shop  or  ACT  Math)  the  following  simulations  are  done: 


nb  = 


0.0 

0.5 


10 


D 


(  guessing 
\  no  guessing 


x  Jr/Jf  — 


'  3000/3000  ) 
<  3000/1000  \ 
1500/1500  J 


Here  “guessing”  means  that  the  estimated  ACT  and  ASVAB  guessing  parameters  are  used 
in  the  model  and  “no  guessing”  means  that  all  cs  are  set  to  zero;  that  is,  2PL  modeling 
is  used.  Also,  “D”  means  that  this  guessing  “factor”  is  randomly  assigned  within  the 
36  levels  produced  by  crossing  the  other  factors. 

For  the  case  of  test  bias  (Cp  >  3)  the  following  simulation  are  done  for  each  test  type: 


D  f  guessing  1 
(  no  guessing  J  ' 


3000/3000  ) 
3000/1000  \ 
1500/1500  J 


For  nb  =  1,  the  nuisance  discrimination  aNr]  of  the  studied  item  is  .8;  for  nb  =  3,  the 
nuisance  discrimination  of  each  of  the  3  studied  items  is  .4.  These  discriminations  were 
chosen  so  that  the  power  of  the  procedure  could  be  well  assessed  (i.e.,  so  that  it  would  not 
be  too  close  to  1).  It  is  informative  to  note  in  passing  that  the  power  of  the  procedure 
is  expected  to  be  greater  when  nb  is  increased  from  1  to  3  unless  each  item  individually 
displays  less  bias  in  the  nb  =  3  case.  This  is  why  the  a{r)  (i  =  N  —  2,  N  —  1,  N)  was  chosen 
to  be  .4  in  the  nb  =  3  case,  |  of  that  used  in  the  nb  =  1  case. 

There  axe  therefore  48  simulation  models  that  incorporate  bias.  Thus,  a  total  of 
84  simulation  models  were  used  in  the  simulation  study. 

RESULTS  OF  THE  SIMULATION  STUDY 

The  results  of  the  simulation  stidy  are  given  in  Tables  5-8  and  9-12,  with  Tables  5-8 
summarizing  the  no  test  bias  simulations  and  Tables  9-12  summarizing  the  simulations 
having  test  bias  present.  The  c  column  indicates  whether  the  model  has  guessing  present 
or  not.  In  all  nb  =  1  cases,  the  Mantel-Haenszel  rejection  rate  for  the  hypothesis  of  no  item 
bias  (based  on  100  trials)  is  reported  in  the  MH  column.  In  all  cases  the  SIB  rejection  rate 
is  reported  in  the  SIB  column.  In  all  cases  where  test  bias  is  present  (Tables  9-12),  the 
Cp  column  presents  the  amount  of  potential  for  bias  present  (recall  (33));  the  fiy  column 
presents  our  index  of  the  amount  of  bias  present  against  the  focal  group  in  the  model 
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(recall  (6));  j3y  is  the  average  of  the  estimates  fiy  of  f$v  over  the  100  trials;  the 
column  presents  the  amount  of  bias  present  against  the  focal  group  in  the  model  from  the 
Mantel-Haenszel  perspective. 

Tables  5-8  indicate  that  both  the  SIB  statistic  and  the  MH  statistic  display  reasonable 
adherence  to  the  nominal  level  of  significance  of  0.05.  There  appear  to  be  situations  of 
no  bias,  which  have  a  target  ability  difference  and  which  depart  from  the  Rasch  model, 
where  the  Mantel-Haenszel  procedure  displays  inflated  Type  1  error.  (See  Zwick  (1990), 
for  a  discussion  of  this  problem  and  an  illustrative  example.)  There  is  evidence  that 
in  such  situations  (Shealy  (1989)),  the  SIB  statistic  adheres  closely  to  the  nominal  level 
of  significance.  On  the  other  hand  there  are  likely  portions  of  the  “parameter  space” 
of  realistic  IRT  models  where  our  linear  regression'  correction  is  stressed  and  hence  the 
MH  would  likely  display  better  Type  1  error  performance.  More  study  is  required  before 
it  can  be  claimed  that  either  MH  or  SIB  displays  superior  Type  1  error  performance. 
The  striking  fact  is  that  both  procedures  seem  to  be  quite  robust  against  the  inflating 
Type  1  error  effect  of  differing  target  ability  distributions.  In  this  regard,  dT  =  1  from  the 
practitioner’s  perspective  is  certainly  a  large  amount  of  target  ability  difference. 

Tables  9  and  11  indicate  that  both  the  SIB  statistic  and  the  MH  statistic  are  quite 
powerful  against  moderate  amounts  of  bias  and  fairly  powerful  against  small  amounts  of 
bias  in  a  single  biased  item.  Untabulated  simulation  studies  for  larger  amounts  of  bias 
produced  rejection  rates  of  essentially  unity  for  both  the  SIB  and  MH  procedures. 

Tables  10  and  12  indicate  that  the  SIB  procedure  is  quite  powerful  against  moderate 
amounts  of  bias  resulting  from  several  (3  here)  items  producing  bias  in  the  same  direction. 
The  reader  should  recall  that  the  amount  of  bias/item  was  lowered  for  the  nb  =  3  case  by 
reducing  the  discrimination  in  the  nuisance  dimension  from  avN  —  0.8  to  a vi  =  0.4  for  the 
studied  items.  In  both  the  nb  =  1  and  nb  =  3  cases,  the  potential  for  bias  as  measured 
by  Cp  was  kept  the  same  (C0  =  0.2  or  0.3).  These  two  table  show,  as  claimed,  that  the 
SIB  procedure  can  successfully  detect  simultaneous  item  bias,  even  if  the  amount  of  bias 
present  per  item  is  small. 

Tables  9  and  11  show,  for  the  particular  bias  models  of  the  simulation  study,  that  SIB 
is  somewhat  more  powerful  than  MH,  averaging  0.07  higher  for  those  models  for  which 
rejection  rates  axe  <  0.9.  We  do  not  know  whether  this  greater  SIB  power  generalizes  to 
other  models  of  bias. 

Tables  9-12  provide  evidence  about  the  ability  of  fry  to  estimate  /?y,  our  measure  of 
the  amount  of  bias  present.  For  each  case  j3v  is  an  indicator  of  the  amount  of  statistical 
bias  one  might  expect  in  using  fiy.  Clearly  statistical  bias  of  roughly  +0.01  is  present. 
The  estimated  standard  errors  for  fiy  are  not  recorded,  but  averaged  (roughly)  about  1/3 
of  fly.  Thus  if  fiy  =  0.05  there  is  likely  a  bias  of  0.01  and  a  standard  error  of  0.017.  Thus, 
crudely,  a  95%  confidence  interval  (if  asymptotic  normality  is  a  good  approximation)  would 
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be  given  by  0.04  ±  0.028.  Here  0.04  =  0.05  —  0.01  is  the  correction  for  statistical  bias.  It 
would  seem  that  provides  a  useful  empirical  index  of  the  amount  of  bias  present  in  a 
statistical  subtest  of  items;  more  work  is  planned  in  studying  its  theoretical  and  empirical 
properties. 

SUMMARY  AND  CONCLUSIONS 

The  SIB  procedure  was  designed  to  test  for  unidirectional  test  bias  residing  in  one  or 
more  items,  using  the  conception  that  test  bias  is  incipient  within  the  two  groups’  ability 
distributions  (in  terms  of  a  difference  in  conditional  nuisance  ability  distributions).  By 
means  of  the  regression  correction  presented  here,  the  inflation  of  the  SIB  test  statistic 
due  to  target  ability  difference  (one  group  having  a  stochastically  larger  distribution  of  0) 
is  extracted.  This  correction  represents  a  conceptual  link  between  conditional-on-observed- 
score  methods  and  IRT-based  methods,  just  as  the  practice  of  including  the  studied  item 
in  the  comparable  examinee  criterion  in  the  Mantel-Haenszel  procedure  of  Holland  and 
Thayer  (19S8)  does.  The  correction  adjusts  the  studied  subtest  scores  for  the  two  groups  so 
that  they  are  now  estimates  of  the  same  latent  IRT  ability  in  the  case  of  no  test  bias,  even  if 
group  target  abilities  exist.  It  is  useful  to  note  that  the  adjustment,  although  conceptually 
based  upon  multidimensional  IRT  modeling,  is  in  fact  computed  using  a  classical  approach 
and  hence  does  not  depend  on  IRT  ability  or  item  parameter  estimation. 

A  moderate  (84  models)  simulation  study  shows  that  both  MH  and  SIB  display  good 
adherence  to  the  nominal  level  of  significance,  even  for  large  ( dT  =  1)  target  ability  differ¬ 
ences.  In  the  case  of  a  single  biased  item,  both  MH  and  SIB  display  good  power  with  SIB 
displaying  slightly  higher  power.  As  designed,  the  SIB  statistic  displays  good  power  in  the 
case  of  several  biased  items  (3  here),  even  when  the  amount  of  bias/item  is  fairly  small. 

A  large  scale  simulation  study  is  in  progress  with  the  goal  of  obtaining  a  better  un¬ 
derstanding  of  the  performance  characteristics  of  both  the  SIB  and  the  MH  statistics  with 
particular  emphasis  on  investigation  of  statistical  power  and  adherence  to  the  nominal 
level  of  significance.  Based  upon  the  completed  portion  of  this  simulation  study  reported 
herein,  we  would  recommend  that  practitioners  use  the  SIB  and  MH  statistics  simultane¬ 
ously.  Both  are  extremely  easy  to  compute  and  for  moderate  sized  data  sets  run  quickly  on 
a  typical  PC  configuration.  Carefully  checked  code  with  a  user  oriented  driver  is  available 
from  the  authors  for  running  both  the  SIB  and  MH  statistics  on  real  data  sets  and  also 
for  doing  simulation  studies  cf  performance. 
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APPENDIX 


1.  Derivation  of  Vg(k),  the  estimated  regression  of  true  on  observed  valid 
subtest  score,  for  k  =  0, . . .  ,n. 

Recall  that  Vg(k)  =  E[P(0)  |  k,g]  needs  to  be  estimated  in  order  for  Sg(Vk )  of  (23) 
to  be  estimated.  Suppressing  g  for  simplicity,  we  need  to  estimate  V(k)  at  k  =  0, 1, . . .  ,n. 
Although  V(k)  is  not  necessarily  linear  in  k  (see  Shealy  (1989),  p.  87ff  for  a  discussion), 
els  an  approximation  we  assume  nV(k)  is  linear  in  k ;  i.e., 

nV(k)  =  a  +  (3k. 

To  estimate  V(fc),  we  consider  the  true  score  model  for  the  valid  subtest  score  X : 


X  =  T  +  e 


(Al) 


where 

£(e)  =  0,  eov(T,  e)  =  0  (A2) 

is  assumed  and  the  true  score  T  has  the  latent  variable  representation  T  =  nP(0).  Thus 

nV(k)  =  E[T  |  k). 

Standard  regression  theory  for  E(T  |  k)  yields 

V(k)  =  -(eT+  PxT-T-(k  -  EX)]  . 
n  \  ax  J 

But,  for  the  true  score  model  given  by  (Al)  and  (A2), 

PxTaT  _  i  _  a2(e) 

^  *2(*r 


(A3) 


(A4) 


is  well  known  (see  page  61  of  Lord  and  Novick  (1968).  Using  (Al)  and  (A2),  ET  =  EX 
holds.  Thus,  by  (A3)  and  (A4), 


m  =  a 

n 


(A5) 


holds. 

Clearly  EX  =  E[X  |  can  be  estimated  by  the  average  valid  subtest  score  Xg 
of  all  Group  g  examinees  taking  the  test.  Thus  it  remains  to  estimate  a2(e)/cr2(X). 
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a2(X)  =  <x2(X  |  g )  can  clearly  be  estimated  by  the  usual  sample  variance  estimate  of  all 
Group  g  examinees  taking  the  test 

I  s)= "  *»)’•  <A6) 

where  Jg  denotes  the  number  of  Group  g  examinees  taking  the  test  and  Xgg  is  the  valid 
subtest  number  correct  score  of  the  j th  such  Group  g  examinee.  It  remains  to  estimate 
<r2(e);  denote  this  estimation  by  d2(e).  Then  the  desired  estimation  of  o2  (e)  /  a2  (X)  will  be 
given  by  d2(e)/d2(X).  A  standard  conditioning  formula  yields,  indexing  the  valid  subtest 
items  by  i  =  1,2,. . .  ,n,  and  setting  Xg  =  X  \  g,  Qg  =  ©  |  g  as  a  reminder  that  sampling 
here  is  from  Group  g  only, 

A*  I  g)  =  a\X,)  =  e2(E[Xg  |  ©,))  +  E\c\X,  |  0,)] 

=  a\nP(Q,))  +  £  £{?,(&,) (1  -  ^(0,))],  ^ 

1=1 

using  the  standard  item  response  theory  assumption  of  local  independence  of  items,  given  G. 
Also,  by  (A2)  it  is  trivial  that 

v2(X  I  9)  =  <72(nP(G)  |  g)  +  cr2(e  |  g). 


Thus,  by  (A7), 


This  suggests 


^(e|9)  =  £JE[pi(0J)(i-pi(e1,))i. 

i=l 


i=i 


(A8) 


where  Uig  is  the  proportion  correct  for  Group  g  examinees  for  valid  subtest  item  i.  Thus, 
using  (A5),  we  will  estimate  Vg(k )  by 


X,+ 


^2(g  1  g)  \ 

a2(X\g)J 


(A9) 


2.  The  complete  procedure  to  detect  test  bias,  using  the  proposed  regres¬ 
sion  correction. 

The  SIB  procedure  in  its  entirety  is  presented  here.  First  we  set  some  basic  notation. 
Group  g  (g  =  Rot  F)  has  J g  examinees  taking  the  test  of  N  items.  The  response  to  item  i 
of  the  ;th  group  g  examinee  is  Ugij.  The  subtest  scores  are 

n  N 

Xgj  =  ^  Ugij  (valid  subtest  score),  Ygj  =  ^  Ugij  (studied  subtest  score). 

i=l  i=n+ 1 
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The  classical  group  item  difficulties  are  Ugi  =  (1/Jff)  Ugij  ■  Let  denote  summa¬ 
tion  over  those  group  g  examinees  j  with  k  correct  on  the  valid  subtest. 

1.  Compute  Jgk,  the  number  of  group  g  examinees  with  k  correct  on  the  valid  subtest. 

2.  Compute 


If  Jgk  =  0,  set  Ygk  ~  0;  if  Jgk  <  1,  set  S]k  =  0.  Ygk  is  the  sample  average  studied 
subtest  score  of  group  g  examinees  attaining  Xg  —  k ,  and  Sgk  is  the  sample  variance. 

3.  Compute  Pg(k)  =  Jgk/Jg,  for  both  groups  and  all  k.  Pg(k )  is  the  estimate  of  the  his¬ 
togram  of  X  |  G  =  g.  Then  compute  Pg{k ),  the  MLE  of  the  unimodalized  histogram 
of  A'  \G  =  g,  over  the  class  of  all  possible  unimodal  MLE  of  the  histograms  with  n  + 1 
possible  values  (A  |  G  =  g  is  assumed  to  have  a  unimodal  distribution  and  hence  its 
estimate  {P*(k),  k  >  0}  should  also  be  unimodal).  For  details  of  this  procedure,  using 
the  up-and-down-blocks  algorithm,  see  Barlow  et  al.  (1972;  pp.  72-73;  pp.  223-231). 

4.  Set  I(k)  =  1  for  all  k  unless  either 

(a)  k  =  0  or  n, 

(b)  Sm •  =  0  or  S2Fk  =  0, 

(c)  JRPjt(k)  <  Jm hl  or  JpPp(k)  <  Jmia  where  Jmin  is  set  by  user,  usually  around  30, 
or 

(d)  k  <  ricy,  where  Cy  >  0  is  the  user-specified  global  guessing  parameter  for  the 
test.  (It  is  assumed  that  there  is  a  relatively  constant  level  of  guessing  across 
item,  and  that  there  is  at  least  partial  knowledge  of  this  guessing  value.) 

/(&),  k  =  0, ...  ,n,  is  the  examinee  inclusion  indicator;  it  is  1  if  examinees  with 
X  =  k  are  to  have  their  responses  included  in  the  test  statistic,  (a)  excludes  the  two 
extreme  valid  subtest  scores  because  of  their  poor  estimation  of  target  ability.  The 

(b)  exclusion  is  obvious.  The  (c)  exclusion  is  done  to  assure  that  each  valid  subtest 
score  category  has  enough  examinees  to  make  YRk  and  YFk  approximately  normal;  the 
unimodal  mass  function  is  used  so  that  only  extreme  valid  subtest  score  catagories  are 
excluded.  As  for  (d),  all  valid  scores  below  that  expected  by  guessing  are  excluded. 

5.  Compute  the  regression  of  true  score  on  valid  subtest  score: 

(a)  U*i  =  If  the  result  is  <  0,  set  it  to  0  (adjustment  for  guessing). 

(c)  *HX\9)  =  Jrj  £&(*„• -v 

(d)  dJ(e  I  g)  =  E?=1  £?;,(  1  -  u-ti) 

fe')  l  —  _ JL -  (\  _ 

°g  ~  n  — 1  V1  <?*( X\g)) 
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(f)  Vg(k)  =  +  bg(k  -  Xg))  for  both  g  and  k  =  0, . . .  ,  n. 

6.  Make  the  regression  correction: 

(a)  kt  -  min{&  :  I(k)  *=  1},  kr  ~  ma x{k  :  I(k)  =  1}. 

(b)  Vt  =  k(VR(k)  +  VF(k)),  for  k,  <  k  <  kr. 

(c)  For  ke  <  k  <  kr,  compute 

M  ,  = _ YgMi  ~  ^g,fc-i 

’  t/ffe  +  l  )-V„(*-l)' 

Then  compute  Y;t  =  fjt  +  Msk(Vk  -  t >(k)). 

(d)  For  k  =  k(  and  k  =  kr,  compute  Y*k  in-the-  following-way. 

i.  Define 

(1  -  ?)?,.*+!  +  if  v^k)  <  V  <  vs (,k  +  1) 

.Ss(u)=.  ?j0  if  o  <  l>s(0) 

•  Y,n  «  O  >  v,(n), 

and 

tt  _  «  -  fy*) 

Vg(k  +  l)-Vg(kY 

Sg(y)  is  the  linear  interpolation  of  . . .  ,yjn}. 

ii.  Compute 

Yg\  =  Sg(Vk) 

for  k  =  ke  and  k  =  kr. 

7.  Compute  the  bias  statistic. 

(a)  Compute  J*  =  Ylk=o  I(k)Jgk>  the  number  of  included  group  g  examinees 

(b)  Compute 

„  gu  -  n  t)/(t) 

(ELo^(SL  +  sK)f(‘))1/2' 

(c)  Reject  if  :  ^  =  0  in  favor  of  j3v  >  0  at  level  a  if  B  >  za,  where  P[N( 0, 1)  > 
zQ ]  =  a  defines  za. 
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Figure  1.  Stochastically  ordered  and  unordcrcd  pairs  of  distributions 


0|A,F  0  | /c,& 


Figure  2.  Prior  and  posterior  target  ability  distributions 


fundamental  latent  scale  (5) 
Figure  3.  The  three  latent  scales. 


Figure  4.  The  valid  su blest  to  studied  subtest  transformation 


Table  1:  Means  and  sds  for  the  ASBAB  and  ACT  item  parameters  used  in  the  study. 


Test 

5 

8 

b 

c 

■ 

N 

ASVAB  auto/shop 

m  | 

0.7 

0.09 

IS 

uni 

25 

ACT  math 

Wroa 

0.35 

0.5 

ESI 

B3 

EES 

40 

Table  2:  Item  parameters  for  2-dimensional  studied  in  the  bias  case. 


Hi 

Item  No. 

die 

ESI 

Cf 

1 

N 

1.0 

0.0 

O.S 

0.0 

c 

3 

N  —  2 

0.6 

-0.3 

0.4 

0.0 

KBE9 

N~1 

O.S 

0.0 

0.4 

0.0 

c 

N 

1.0 

0.3 

0.4 

0.0 

usm 

Table  3:  Equivalence  table  for  bias  potential  and  actual  test  bias. 


E29 

- 

liM 

0.8 

0.05 

3 

- 

3 

lifl 

0.06 

3 

0.4 

0.09 

Table  4:  Equivalence  of  A/,/#  and  fiu  when  n\,  —  1,  using  item  parameters  of  Table  2. 


C0 

c’s  used 

&MH 

fiu 

0.0 

- 

0 

0 

0.2 

•  0.0 

.27 

0.034 

0.2 

actual  c’s 

.27 

0.026 

0.3 

6.0 

.40 

0.051 

0.3 

actual  c’s 

.39 

0.039 

Table  5:  No  bias,  ACT,  m  =  1,  a  =  0.05. 


Jf 

Jr 

d 

dx 

MH 

SIB 

■Mil 

1500 

a 

•0 

.03  j 

•07 

1000 

3000 

u 

.0 

.00 

.02 

3000 

3000 

a 

.0 

.09 

.06 

■Mil 

1500 

□ 

.5 

.04 

.04 

1000 

3000 

a 

.5 

.10 

.10 

3000 

3000 

c 

.5 

.05 

.03 

1500 

1500 

c 

1.0 

.02 

.05 

1000 

3000 

C 

1.0 

.05 

.10 

3000 

3000 

a 

1.0 

.06 

.09 

Table  6:  No  bias,  ACT,  rn  =  3,  a  =  0.05. 


Jf 

Jr 

D 

SIB 

1500 

1500 

a 

m 

.05 

1000 

3000 

D 

HI 

.02 

3000 

3000 

D 

HI 

.07 

1500 

■Mil 

O 

.5 

.OS 

1000 

3000 

B 

.5 

.07 

3000 

3000 

B 

.5 

.05 

1500 

1500 

c 

1.0 

.06 

1000 

3000 

c 

1.0 

.16 

3000 

3000 

a 

1.0 

.09 

Table  7:  No  bias,  ASVAB,  n*  =  1,  a  =  0.05. 


Jf 

Jr 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

■Mil 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

MH 

SIB 

.OS 

.07 

.04 

.04 

.06 

.06 

.13 

.14 

.04 

.03 

.05 

.04 

.07 

.02 

.15 

.09 

.11 

.01 

Table  S:  No  bias,  ASVAB,  nj,  =  3,  ct  =  0.05 


Jf 

Jr 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

Table  9:  Bias,  a,  =  0.8,  ACT,  ?ij,  =  1,  a  =  0.05. 


mm 

Jr 

c 

dx 

Cp 

fiu 

R 

A  MH 

MH 

SIB 

1500 

1500 

c 

T 

.2 

.026 

.032 

.27 

.46 

.58 

1000 

3000 

IQI 

JO 

m\ 

.032 

.042 

.27 

.64 

.70 

3000 

3000 

UKIEJI 

.032 

.035 

.27 

.91 

.95 

1500 

1500 

DDDl 

.02P 

.035 

.27 

.51 

.60 

1000 

3000 

HI 

.5 

.2 

.034 

.044 

.27 

.65 

.72 

3000 

3000 

nr 

.5 

m\ 

.034 

.038 

.27 

.91 

.94 

1500 

1500 

HI 

0 

mi 

.048 

.052 

.40 

.84 

.90 

1000 

3000 

IQHI 

.3 

.042 

.053 

.40 

.87 

.91 

3000 

3000 

au 

.3 

.042 

.045 

.40 

.97 

1.00 

1500 

1500 

□El 

.3 

.050 

.047 

.40 

.99 

.99 

1000 

3000 

rn 

•51 

|  -3 

.042 

.054 

.40 

.SO 

.84 

3000 

3000 

OBJ 

.3 

.042 

.064 

.40 

.91 

.92 

Table  10:  Bias,  a,  =  0.4,  ACT,  =  3,  a  s  0.05. 


Jf 

Jr 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

1500 

1500 

1000 

3000 

3000 

3000 

& 

SIB 

.063 

.069 

.70 

.053 

.067 

.68 

.053 

.053 

.SO 

.055 

.071 

.60 

.065 

•0S3 

.72 

.065 

.074 

.96 

.093 

.095 

.91 

.093 

.11 

.89 

.OSO 

.081 

.99 

.097 

.12 

.97 

.0S4 

■HI 

.89 

.0S3 

.09 

1.00 

Table  11:  Bias,  a,  =  0.8,  ASVAB,  n*  =  1,  a  =  0.05. 


J> 

Jr 

c 

■<T 

Cfi 

fiu 

& 

MH 

SIB 

1500 

1500 

c 

0 

.2 

.026 

.029 

.27 

.42 

.50 

1000 

3000 

Dl 

0 

.2| 

.034 

.039 

.27 

.63 

.79 

3000 

3000 

im 

0 

m 

.034 

.034 

.27 

.90 

.95 

1500 

1500 

mm 

.2 

.027 

.035 

.27 

.63 

.66 

1000 

3000 

Hi 

.034 

.03S 

.27 

.63 

.70 

3000 

3000 

IDI 

.5 

.2 

.034 

.036 

.27 

.89 

.91 

1500 

1500 

Dl 

0 

.3 

.051 

.052 

.40 

.85 

.92 

1000 

3000 

mm 

.3 

.042 

.044 

.40 

.77 

.84 

3000 

3000 

rn 

0 

.3 

.042 

.046 

.40 

.99 

.99 

1500 

1500 

QB 

Hi 

.051 

.057 

.40 

.91 

.93 

1000 

3000 

c 

.5 

.3 

.038 

•04S 

.40 

.77 

.82 

3000 

3000 

c 

.5 

i  .3 

.039 

.045 

.40 

.94 

.97 

Table  12:  Bias,  a,  =  0.4,  ASVAB,  n*  =  3,  a  =  0.05. 


Jf 

Jr 

Jcl 

dT 

Cfi 

Pu 

A. 

SIB 

1500 

1500 

0 

.2 

.065 

.067 

.70 

1000 

3000 

IE3KI 

.2 

.052 

.056 

.53 

3000 

3000 

c 

m 

.052 

.053 

.85 

1500  | 

1500 

c 

.5 

.2 

.052  ; 

.068 

.63 

1000 

3000 

IDI 

.5 

.2 

.064  j 

.0S3 

.73 

3000 

3000 

iai 

.2 

.064 

.072 

.92 

1500 

1500 

mm 

.3 

.098 

.10 

.94 

1000 

3000 

01 

.3 

.097 

.10 

.97 

3000 

3000 

lauKi 

.079 

.079 

.98 

1500 

1500 

UEJKJ 

.097 

.011 

.98 

1000 

3000 

Dl 

Al 

.3 

.076 

.098 

.87 

3000 

3000 

DEI  HI 

.07S 

.090 

.99 
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